-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-43427][PROTOBUF] spark protobuf: allow upcasting unsigned integer types #43773
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-43427][PROTOBUF] spark protobuf: allow upcasting unsigned integer types #43773
Conversation
599b016 to
94861f8
Compare
94861f8 to
9ba5d2f
Compare
|
@rangadi do you mind taking a look? i've modified this to make it an option so that the default behavior doesn't change |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I wouldn't add this. It would be problematic when Spark has unsigned types. For the same reason, Parquet also doesn't support unsigned physical types for Spark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be problematic when Spark has unsigned types. For the same reason, Parquet also doesn't support unsigned physical types for Spark.
hey, i'm not sure if i follow; do you mind explaining what you mean by this?
My goal here is to add an option allowing unsigned 32 and 64 bit integers coming from protobuf to be represented in a type that can contain them without overflow. I mention this in the description, but I actually modeled my code off of how the parquet code today is written, which i believe is doing this same thing by default:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it was added at here #31921. My memory was old.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, cc @yaooqinn @cloud-fan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am less certain about over all Spark context, but for from_protobuf() this looks fine so far (since it is only enabled with an option).
|
bump on this @rangadi in case you get a chance 🙏 |
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems at least coherent at high level although I don't know the details in the code
|
@rangadi do you mind taking another look here? |
rangadi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good to me. @HyukjinKwon should we wait from more reviews?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can use protobufOptions.upcastUnsignedInts directly below.
Or add 'option' to this variable's name. It is more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, i've inlined it 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to add a log here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh like a print statement? im not sure, would that be helpful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am less certain about over all Spark context, but for from_protobuf() this looks fine so far (since it is only enabled with an option).
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine. LGTM too
Signed-off-by: Parth Upadhyay <parth.upadhyay@gmail.com>
9ba5d2f to
11ff320
Compare
rangadi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @justaparth.
@HyukjinKwon please merge when you get a chance.
|
Merged to master. |
What changes were proposed in this pull request?
JIRA: https://issues.apache.org/jira/browse/SPARK-43427
Protobuf supports unsigned integer types, including uint32 and uint64. When deserializing protobuf values with fields of these types,
from_protobufcurrently transforms them to the spark types of:IntegerType and LongType are signed integer types, so this can lead to confusing results. Namely, if a uint32 value in a stored proto is above 2^31 or a uint64 value is above 2^63, their representation in binary will contain a 1 in the highest bit, which when interpreted as a signed integer will be negative (I.e. overflow). No information is lost, as
IntegerTypeandLongTypecontain 32 and 64 bits respectively, however their representation can be confusing.In this PR, we add an option (
upcast.unsigned.ints) to allow upcasting unsigned integer types into a larger integer type that can represent them natively, i.e.I added an option so that it doesn't break any existing clients.
Example of current behavior
Consider a protobuf message like:
If we compile the above and then generate a message with a value for
valabove 2^63:This generates the binary representation:
b'\x08\x81\x80\x80\x80\x80\x80\x80\x80\x80\x01'
Then, deserializing this using
from_protobuf, we can see that it is represented as a negative number. I did this in a notebook so its easier to see, but could reproduce in a scala test as well:Precedent
I believe that unsigned integer types in parquet are deserialized in a similar manner, i.e. put into a larger type so that the unsigned representation natively fits. https://issues.apache.org/jira/browse/SPARK-34817 and #31921. So an option to get similar behavior would be useful.
Why are the changes needed?
Improve unsigned integer deserialization behavior.
Does this PR introduce any user-facing change?
Yes, adds a new option.
How was this patch tested?
Unit Testing
Was this patch authored or co-authored using generative AI tooling?
No